Search CORE

296 research outputs found

DNA Hash Pooling and its Applications

Author: Amos Martyn
Shasha Dennis
Publication venue
Publication date: 01/01/2007
Field of study

In this paper we describe a new technique for the comparison of populations of DNA strands. Comparison is vital to the study of ecological systems, at both the micro and macro scales. Existing methods make use of DNA sequencing and cloning, which can prove costly and time consuming, even with current sequencing techniques. Our overall objective is to address questions such as: (i) (Genome detection) Is a known genome sequence present, at least in part, in an environmental sample? (ii) (Sequence query) Is a specific fragment sequence present in a sample? (iii) (Similarity discovery) How similar in terms of sequence content are two unsequenced samples? We propose a method involving multiple filtering criteria that result in "pools" of DNA of high or very high purity. Because our method is similar in spirit to hashing in computer science, we call it DNA hash pooling. To illustrate this method, we describe protocols using pairs of restriction enzymes. The in silico empirical results we present reflect a sensitivity to experimental error. Our method will normally be performed as a filtering step prior to sequencing in order to reduce the amount of sequencing required (generally by a factor of 10 or more). Even as sequencing becomes cheaper, an order of magnitude remains important.Comment: 14 pages, 3 figures. To appear in the International Journal of Nanotechnology and Molecular Computation. Improved background, analysis and reference

arXiv.org e-Print Archive

CiteSeerX

Northumbria Research Link

E-space: Manchester Metropolitan University's Research Repository

Fast parallel algorithms for the unit cost editing distance between trees

Author: Dennis Shasha
Kaizhong Zhang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/1989
Field of study

1. Problem Ordered labeled trees are trees whose nodes are labeled and in which the ° left-to-right order among siblings is significant. We consider the distance between two trees to be the minimum number of edit operations (insert, delete, and modify) necessary to transform one tree to another. We present three algorithms to find the distance. The first algorithm is a simple dynamic program-ming algorithm based on a postorder traversal whose complexity improves upon the best previ-ously published algorithm due to Tai (T79 in JACM). The second and third algorithms are parallel algorithms based on the application of suf-fix trees to the comparison problem. The cost of executing these algorithms is a monotonic increas-ing function of the distance between the two trees. Results Let trees T I and T2 have numbers of levels L i and L 2 respectively. Let k be the actual distance between T 1 and T2. Let N be rain (IT11, IT2]). The asymptotic running times (assuming a concurrent-read concurrent-write parallel random access machine) are: A lgor i thm T ime Processors Tai IT l lX [T2[xL~XL] Alg l [Tx [ × Ir=l xLI×L

CiteSeerX

Crossref

A Collaborative Approach to Computational Reproducibility

Author: Capone Rebecca
Chirigati Fernando
Freire Juliana
Rampin Remi
Shasha Dennis
Publication venue
Publication date: 01/01/2016
Field of study

Although a standard in natural science, reproducibility has been only episodically applied in experimental computer science. Scientific papers often present a large number of tables, plots and pictures that summarize the obtained results, but then loosely describe the steps taken to derive them. Not only can the methods and the implementation be complex, but also their configuration may require setting many parameters and/or depend on particular system configurations. While many researchers recognize the importance of reproducibility, the challenge of making it happen often outweigh the benefits. Fortunately, a plethora of reproducibility solutions have been recently designed and implemented by the community. In particular, packaging tools (e.g., ReproZip) and virtualization tools (e.g., Docker) are promising solutions towards facilitating reproducibility for both authors and reviewers. To address the incentive problem, we have implemented a new publication model for the Reproducibility Section of Information Systems Journal. In this section, authors submit a reproducibility paper that explains in detail the computational assets from a previous published manuscript in Information Systems

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

FigShare

Debugging Machine Learning Pipelines

Author: Freire Juliana
Lourenço Raoni
Shasha Dennis
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time-consuming and error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our source code and experimental data will be available for reproducibility and enhancement.Comment: 10 page

arXiv.org e-Print Archive

Crossref

Scipedia

Open Repository and Bibliography - Luxembourg

Constellation Queries over Big Data

Author: Khatibi Amir
Nobre João R.
Ogasawara Eduardo
Porto Fabio
Shasha Dennis
Valduriez Patrick
Publication venue
Publication date: 07/03/2017
Field of study

A geometrical pattern is a set of points with all pairwise distances (or, more generally, relative distances) specified. Finding matches to such patterns has applications to spatial data in seismic, astronomical, and transportation contexts. For example, a particularly interesting geometric pattern in astronomy is the Einstein cross, which is an astronomical phenomenon in which a single quasar is observed as four distinct sky objects (due to gravitational lensing) when captured by earth telescopes. Finding such crosses, as well as other geometric patterns, is a challenging problem as the potential number of sets of elements that compose shapes is exponentially large in the size of the dataset and the pattern. In this paper, we denote geometric patterns as constellation queries and propose algorithms to find them in large data applications. Our methods combine quadtrees, matrix multiplication, and unindexed join processing to discover sets of points that match a geometric pattern within some additive factor on the pairwise distances. Our distributed experiments show that the choice of composition algorithm (matrix multiplication or nested loops) depends on the freedom introduced in the query geometry through the distance additive factor. Three clearly identified blocks of threshold values guide the choice of the best composition algorithm. Finally, solving the problem for relative distances requires a novel continuous-to-discrete transformation. To the best of our knowledge this paper is the first to investigate constellation queries at scale

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

SafePredict: A Machine Learning Meta-Algorithm That Uses Refusals to Guarantee Correctness

Author: Erkip Elza
Kocak Mustafa,
Ramirez David
Shasha Dennis
Publication venue: HAL CCSD
Publication date: 10/06/2019
Field of study

International audienc